X-tract: Structure Extraction from Botanical Textual Descriptions
نویسندگان
چکیده
1 Center for Research in Information and Automation Technologies Abstract Most available information today, both from printed books and digital repositories, is in the form of freeformat texts. The task of retrieving information from these ever-growing repositories has become a challenge for information retrieval (IR) researchers. In some fields, such as Botany and Taxonomy, textual descriptions observe a set of rules and use a relatively limited vocabulary. This makes botanical textual descriptions an interesting area to explore IR techniques for finding structure and facilitating semantic analysis. This paper presents X-tract, a solution to the problem of text analysis and structure extraction in a specific application domain, namely floristic morphologic descriptions. The solution demonstrates the potential o f using a grammar in the determination of information structure in a botanical digital library. We have developed a prototype based on this approach in which given an HTML or plain text, X-tract analyzes it and presents results to the user so he or she can verify the proposed structure before updating the database. This transformation is useful also in the process of storing morphologic descriptions in a database with a preestablished format. The solution i s implemented in the context of the Floristic Digital Library (FDL), a large digital library project comprising a wide variety of botanical documents, formats and services.
منابع مشابه
Attributes Extraction from Product Descriptions on e-Shops
Some e-shops present product attributes in structured form, but many others use the textual description only. Attributes of products are essential in automated product deduplication. We suggest methods for automated extraction of attributes and their values from product descriptions to a structural form. The structural data extracted from other e-shops are used as background knowledge.
متن کاملWeb Services Discovery and Recommendation Based on Information Extraction and Symbolic Reputation
This paper shows that the problem of web services representation is crucial and analyzes the various factors that influence on it. It presents the traditional representation of web services considering traditional textual descriptions based on the information contained in WSDL files. Unfortunately, textual web services descriptions are dirty and need significant cleaning to keep only useful inf...
متن کاملPredicting Sales from the Language of Product Descriptions
What can a business say to attract customers? E-commerce vendors frequently sell the same items but use different marketing strategies to present their goods. Understanding consumer responses to this heterogeneous landscape of information is important both as business intelligence and, more broadly, a window into consumer attitudes. When studying consumer behavior, the existing literature is pr...
متن کاملInformation Extraction for Standardization of Tourism Products
Tourism product descriptions are strongly supported on natural language expressions. Appropriate offer selection, according to tourist needs, depends highly on how these are communicated. Since no human interaction is available while presenting tourism products online, the way these are presented, even when using only textual information, is a key success factor for tourism web sites to achieve...
متن کاملAutomated extraction of product comparison matrices from informal product descriptions
Domain analysts, product managers, or customers aim to capture the important features and differences among a set of related products. A case-by-case reviewing of each product description is a laborious and time-consuming task that fails to deliver a condense view of a family of product. In this article, we investigate the use of automated techniques for synthesizing a product comparison matrix...
متن کامل